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Abstract 



We propose a new data-augmentation strategy for fully Bayesian inference in models 
with binomial likelihoods. The approach appeals to a new class of Polya-Gamma distri- 
butions, which are constructed in detail. A variety of examples are presented to show the 
power of the method, including logistic regression, negative binomial models for count 
data, nonlinear mixed-effects models, and spatial models for count data. In each case, 
our data-augmentation strategy leads to simple, effective methods for posterior infer- 
ence that: (1) circumvent the need for analytic approximations, numerical integration, 
or Metropolis-Hastings; and (2) outperform other known data-augmentation strategies, 
both in ease of use and in computational efficiency All methods are implemented in 
the R package BayesLogit. 

In the technical supplement appended to the end of the paper, we provide fur- 
ther details regarding the generation of Polya-Gamma random variables; the empirical 
benchmarks reported in the main manuscript; and the extension of the basic data- 
augmentation framework to contingency tables and multinomial outcomes. 

1 Introduction 

Many common likelihoods involve products of the form 



Two familiar cases are logistic and negative-binomial regression, where bi and a% are the 
number of trials and successes for subject i, and ip{ = xJ/3 is a linear predictor. 

In this paper, we propose a new latent-variable representation of likelihoods involving 
terms like . This leads to efficient Gibbs-sampling algorithms for a wide class of models 
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that have previously eluded simple treatment. Our method appeals to the new family of 
Polya-Gamma distributions, defined briefly here and constructed in detail in Section [2] 

Definition 1. A random variable X has a Polya-Gamma distribution with parameters 
b > and c G K, denoted X ~ PG(6, c), if 

x £ _ V 9 — (2) 

2^2 2^ (A; — 1/2) 2 + C 2 / (47T 2 ) ' 1 J 

where each ~ Ga(6, 1) is an independent gamma random variable. 

Our main result (Theorem [TJ below) states that binomial likelihoods parametrized by 
log-odds can be written as mixtures of Gaussians with respect to a Polya-Gamma distribu- 
tion. Specifically, if cj ~ PG(6, 0), b > 0, then 

( „i>\a pea 

(1 + e^) 6 y ^ v 7 

where k = a — 6/2. When ^ = x T /3 is a linear function of predictors, the full likelihood in 
(3 will therefore be conditionally Gaussian. Moreover, the full conditional distribution for 
ui, given tp, is also a Polya-Gamma distribution. This suggests a basic template for Gibbs 
sampling across a wide class of models with binomial likelihoods and conditionally Gaussian 
priors. The template is very simple, especially compared to existing methods in this realm: 
Gaussian draws for the main parameters of interest, and Polya-Gamma draws for a single 
layer of latent variables. 

Many previous approaches have been proposed for estimating Bayesian models within 
this class. This includes the Metropolis-Hastings method, along with at least half a dozen 
other latent-variable schemes that facilitate Gibbs sampling. Thus a major aim of our paper 
is to give practitioners a sense of which algorithms are best suited to given circumstances. 
We present evidence in support of two claims: 

1. In simple binomial models with abundant data and no hierarchical structure, the 
Polya-Gamma method is a close second to the independence Metropolis-Hastings sam- 
pler, as long as the proposal distribution is chosen carefully. 

2. In virtually all other cases, our Polya-Gamma approach is most efficient. 

The one exception we have encountered to the second claim is the case of a negative- 
binomial regression model with more than about 20 counts per observation, and with little 
hierarchical structure in the prior. In this case we would recommend a variant on the 



auxiliary-mixture sampling approach of Fruhwirth-Schnatter et al. (2009), with the caution 



that this method is an approximation (albeit one that simulation evidence suggests is quite 
good). For reasons that we explain below, with 20 counts per observation the difference in 
efficiency is small, but with thousands of counts per observation, it is very large. 

This caveat notwithstanding, the Polya-Gamma scheme offers real advantages, both 
in speed and simplicity, across a wide variety of structured Bayesian models for binary 
and count data. In general, the more complex the model, and the more time that one 
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must spend sampling the main parameters in a model, the larger will be the advantage of 
the method introduced here. The difference is especially notable for the Gaussian-process 
spatial models we consider, which require expensive matrix operations. We conclude that 
the Polya-Gamma method is the best choice when using any of the complicated hierarchical 
priors that now characterize much of modern applied Bayesian work. 

The paper proceeds as follows. The class of Polya-Gamma variables is introduced in 
Section 2, and used to derive a data-augmentation scheme for binomial likelihoods in Section 
3. Section 4 describes a method for simulating from the Polya-Gamma distribution, which is 
needed for Gibbs sampling. The expression in ^ suggests a naive way to approximate this 
distribution using a large number of independent gamma draws. But we describe an exact 
method, which avoids the difficulties that can result from truncating an infinite sum. Our 
method, which is implemented in the R package BayesLogit, is an accept /reject sampler 
with a carefully designed proposal based on the alternating-series method of Devroye ( 1986 ). 
For the basic PG(l,c) case, the sampler is very efficient: it requires only exponential and 
inverse-Gaussian draws, and the probability of accepting a proposed draw is uniformly 
bounded below at 0.99919 (Proposition 2). Moreover, it is fully automatic, in the sense 
that it requires no tuning by the user to get optimal performance. 

Section 5 describes the results of an extensive benchmarking study comparing the ef- 
ficiency of our method to other data-augmentation schemes. Section [6] concludes with a 
discussion of some open issues related to our proposal. Many further details of the Polya- 
Gamma sampling algorithm and our empirical study of its efficiency are deferred to a 
technical supplement. 



2 The Polya-Gamma distribution 

2.1 The case PG(6, 0) 

The key step in our approach is the construction of the Polya-Gamma distribution. Together 
with the sampling method described in Section |4j this greatly simplifies Bayesian inference 
in models with binomial likelihoods. 

The Polya-Gamma family of distributions, PG(6, z), is a subset of the class of infinite 
convolutions of gamma distributions. Our initial work focused on the PG(1, 0) distribution, 
which is a carefully chosen element of the class of infinite convolutions of exponential distri- 



butions, also know as Polya distributions ( Barndorff-Nielsen et al. 1982), that has Laplace 



transform E{exp(— uit)} = cosh~ 1 (y / t/2). More generally, one may define oj ~ PG(b,0), 
b > 0, as the infinite convolution of gamma distributions, hence the name Polya-Gamma, 
that has Laplace transform 

E{exp(-^)} - g (l + w - W - ^7P) ; (3) 
the last equality is a consequence of the Weierstrass factorization theorem. Inverting the 
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Laplace transform, one finds that uj ~ PG(6, 0) may be defined by 



d l gk 
° " 2- ^ 



where g\. are independent Gamma(6, 1) random variables. 

The PG(6, 0) class of distributions is closely related to a subset of distributions, which 



we denote by J*(b), b > 0, that are surveyed by Biane et al. (2001) and that have Laplace 
transforms 

E{ e -* J *( 6 )} = cosh- 6 (V2t) . (4) 

These have close connections with the Jacobi Theta and Riemann Zeta Functions, and with 
Brownian excursions. They are related to the Polya-Gamma family by PG(6, 0) = J*(6)/4. 

2.2 The general PG(6, c) class 

The general PG(6, c) class arises through an exponential tilting of the PG(b, 0) density, much 
in the same way that a Gaussian likelihood combines with a Gamma prior for a precision. 
Specifically, a PG(6, c) random variable has the probability density function 



exp p{uj | b, 0) 

pMM = —r ' 2 s , (5) 

E w |exp(-^-w) j 

where p{uj \ b,0) is the density of a PG(6, 0) random variable. The expectation in the 
denominator is taken with respect to the PG(6, 0) distribution and is thus cosh _& (c/2) by 
Q, ensuring that p(u \ b,c) is a valid density. 

The Laplace transform of a PG(fe, c) distribution may be calculated by appealing to the 
Weierstrass factorization theorem again: 

E u {exp (-ujt)} = -. } 2J . (6) 




OO , v -l 

JJ(1 + <T k H)- b , where d k = 2\k--j tt 2 + c 2 /2 . 



Each term in the product is recognizable as the Laplace transform of a gamma distribu- 
tion. We can therefore write a PG(6, c) as an infinite convolution of gamma distributions, 

Ga(6,l) 1 ^ Ga(M) 

UJ 



d Ga(6, 1) 1 



which is the form given in Definition 1. 
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2.3 Further properties 

The density of a Polya-Gamma random variable can be expressed as an alternating-sign sum 



of inverse- Gaussian densities. From the characterization of J* (b) density given by Biane 



et al. (2001), we know that the PG(b,0) distribution has density 

f( x \bo) = 2 —T(-ir T{n+b) (2n+ 
m zJ L ) T(n+1) 



The density of PG(b, z) distribution is then computed by an exponential tilt and a renor- 
malization: 



/ x b,c) = {cosh" (c/2)}— 2^(-l ra ' y -e b. 

r W r(n+ 1) V2vrx 3 



Notice that the normalizing constant is known directly from the Laplace transform of the 
PG(6,0). 

A further useful fact is that all finite moments of a Polya-Gamma random variable are 
available in closed form. In particular, the expectation may be calculated directly. This 
allows the Polya-Gamma scheme to be used in EM algorithms, where the latent w's will 
form a set of complete-data sufficient statistics for the main parameter. We arrive at this 
result by appealing to the Laplace transform of oo ~ PG(6, c). Differentiating ([6]) with 
respect to t, negating, and evaluating at zero yields 

E(w) = — tanh(c/2) 



2c w ' 2c \1 + e c 1 + e 

Lastly, the Polya-Gamma class is closed under convolution for random variates with 
the same tilting parameter. If ui\ ~ PG(&i,z) and UJ2 ~ PG(b2,z) are independent, then 
u\ + 0J2 ~ PG{b\ + 62 j This follows from the Laplace transform. We will employ this 
property later when constructing a Polya-Gamma sampler. 

3 The data- augment at ion strategy 
3.1 Main result 

The Polya-Gamma family has been carefully constructed to yield the following result. 

Theorem 1. Let p(oo) denote the density of the random variable 00 ~ PG(6, 0), b > 0. 
Then the following integral identity holds for all a € M: 



/ p i>\a roc 



(1 + e^f Jo 

where k = a — 6/2. 



(7) 
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Moreover, the conditional distribution 



p(ui | V) 



which arises in treating the integrand in ([7]) as an unnormalized joint density in (tp,uj), is 
also in the Polya-Gamma class: {uj \ ip) ~ PG(6, ip). 

Proof. Appealing to ([3]), we may write the likelihood in Q as 

(e^) a _ 2- b exp{Kip} 
(1 + e^) 6 ~ cosh b (V>/2) 

= 2- b e^ E a; {exp(- W V 2 /2} , 

where the expectation is taken with respect to to ~ PG(6, 0), and where k = a — 6/2. 
Turn now to the conditional distribution 



p(u | V) 



e -^ 2 /2 p ( w ) 



J °° e~^ 2 /2 p ( w ) da; ' 



where p(oj) is the density of the prior, PG(6, 0). This is of the same form as ([5]), with ip = c. 
Therefore (u \ V>) ~ PG(6, V)- 

□ 

The utility of this construction is most easily understood in the case of the binomial 
logit model. Let yi G {0, 1} be a binary outcome for unit i, with corresponding predictors 
x% = (xn, . . . , Xi p ). Let ipi = xj '(3 be the log odds that yi = 1. Appealing to Theorem [TJ 
the contribution of yi to the likelihood in (3 may be written as 



{exp(af/3)>* 
1 + exp(xf (3) 

oc exp(Kixff3) / exp{-Ui(xfp) 2 / 2} p(uji \ 1,0) , 
Jo 



where m = y% — 1 /2, and where p(u}% | 1, 0) is the density of a Polya-Gamma random variable 
with parameters (1,0). 

Combining all n terms gives the following expression for the conditional likelihood in (3, 
given u = (ui, . . . ,u n ): 

n n 

L{j3 | lj) = Y[ L i(P I u i) « Y[exp{KixJ/3-LJi(xJl3) 2 /2} 

i=l 
n 

II eXP { Y - Ki/Wi) 2 } 



i=l i=l 



OC 

=1 



oc exp<j -^(z- Xf3) T n(z- X(3) 
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Random Intercept by District 
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Figure 1: Marginal posterior distribution of random intercepts for each district found in a 
Bangladeshi contraception survey. One may easily alter a Gibbs sampler built for binomial 
logistic regression to accomodate a binomial logistic mixed model. For 10,000 samples after 
2,000 burn-in, median ESS=8168 and median ESR=59.88 for the PG method. 




where z = (ki/ui, . . . , K n /ui n ), and where Q = diag(wi, . . . , w n ). Given all Wj terms, we 
therefore have a conditionally Gaussian likelihood in (3, with pseudo-response vector z, 
design matrix X, and covariance matrix 

Supposing that (3 ~ N(6, B), one may sample from the joint posterior distribution for 
(3 by iterating two simple Gibbs steps: 

(uh |/3) ~ PG(l,xf/3) 
(J3\y,u) ~ N(m, V) , 

where 

v = (x T nx + B- 1 )- 1 

m = C{X T ttz + B~ l b) , 
recalling that z% = ui^ l {yi — 1/2). 



3.2 Mixed Model Example 

We have introduced the Polya-Gamma method in the context of a binary logit model. 
We do this with the understanding that, when data are abundant, the Metropolis-Hastings 
algorithm with independent proposals is likely to be the most efficient method, as asymptotic 
theory suggests that a normal approximation to the posterior distribution will become very 
accurate as data accumulate. This is well understood among Bayesian practitioners (e.g. 



Carlin, 1992 Gelman et al. 2004). 



But the real advantage of data augmentation, and the Polya-Gamma technique in par- 
ticular, is that it becomes easy to construct and fit more complicated models. For instance, 
the Polya-Gamma method trivially accommodates mixed models, factor models, and mod- 
els with a spatial or dynamic structure. For most problems in this class, good Metropolis- 
Hastings samplers are very difficult to design, and at the very least will require ad-hoc 
tuning to yield good performance. 

Several relevant examples are considered in Section 5. But as an initial illustration of 
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the point, we fit a binomial logistic mixed model using the data on contraceptive use among 



Bangladeshi women provided by the R package mlmRev (Bates et al. , 2011 ). The data comes 



from a Bangladeshi survey whose predictors include a woman's age, the number of children 
at the time of the survey, whether the woman lives in an urban or rural area, and a more 
specific geographic identifier based upon the district in which the woman resides. Some 
districts have few observations and district 54 has no observations; thus, a mixed model is 
necessary if one wants to include this effect. The response identifies contraception use. We 
fit the mixed model 



yij ~ Binom(l,py) , pij 

tpij = m + 5j + x'ij/3, 

5j ~ iV(0,l/$, 

m ~ N(0,k 2 /4>), 



1 + e^J ' 



where i and j correspond to the ith observation from the jth district. The fixed effect (3 is 
given a N(0, 100/) prior while the precision parameter cj> is given Ga(l, 1) prior. We take 
k — >• oo to recover an improper prior for the global intercept m. Figure [l] shows the box 
plots of the posterior draws of the random intercepts m + 6j . If one does not shrink these 
random intercepts to a global mean using a mixed model, then several take on unrealistic 
values due to the unbalanced design. 

We emphasize that there are many ways to model this data, and that we do not intend 
our analysis to be taken as definitive. It is merely a proof of concept, showing how various 
aspects of Bayesian hierarchical modeling — in this case, a models with both fixed and ran- 
dom effects — can be combined routinely with binomial likelihoods using the Polya-Gamma 
scheme. Together these changes require just a few lines of code and a few extra seconds of 
runtime compared to the binomial logit case. A posterior draw of 2,000 samples for this 
data set takes 26.1 seconds for a binomial logistic regression, versus 27.3 seconds for a bino- 
mial logistic mixed model. As seen in the negative binomial examples below, one may also 
painlessly incorporate a more complex prior structure using the Polya-Gamma technique. 
For instance, if given information about the geographic location of each district, one could 
place spatial process prior upon the random offsets {Sj}- 

3.3 Existing data-augmentation schemes 



A comparison with Holmes and Held (2006) and Fruhwirth-Schnatter and Fruhwirth (2010) 



clarifies how the Polya-Gamma method differs from previous attempts at data augmenta- 



tion. Both of these papers rely on a hierarchy like that of Albert and Chib (1993b), where 



the outcomes yi are assumed to be thresholded versions of an underlying continuous quantity 



Zi > 
Zi < 



z. 



xjf3 + €i , €i ~ Lo(l) 



(8) 
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where ej ~ Lo(l) has a standard logistic distribution. Upon marginalizing over Zj, the origi- 
nal binomial likelihood is recovered. Albert and Chib represent the probit regression model 
in the same way, subject to the modification that ej has a standard normal distribution. 

The formulation ([8]) does not lead directly to an easy method for sampling from the 
posterior distribution of /3. This differs from the probit case, where one may exploit the 
conditional normality of the likelihood in /3, given z%. To respond to this difficulty, one 
may add another layer of auxiliary variables to represent the logistic error as a normal-scale 



mixture (Holmes and Held, 2006): 



(ej I 4>i) 



N(0,tf 
(2A, 2 ) 



Ai~KS(l), 



where A« has a Kolmogorov-Smirnov distribution (Andrews and Mallows, 1974). Alter 



natively, one may approximate the logistic error term as a discrete mixture of normals 



(Fruhwirth-Schnatter and Friihwirth, 2010): 



(ei|&) ~ N(0,&) 

K 

(pi ~ ^2 w k S^ k ) , 

k=l 

where 5<p indicates a Dirac measure at (f>. The weights Wk and the points 4>^ in the discrete 
mixture are fixed for a given choice of K so that the Kullback-Leibler divergence from the 
true distribution of the random utilities is minimized. IFriihwirth-Schnatter and Friihwirthl 
(2010) find that the choice of K = 10 leads to a good approximation, and list the optimal 



weights and variances for this choice. 

In both cases, posterior sampling can be done in two blocks, sampling the complete 
conditional of (3 in one block and sampling the joint complete conditional of both layers of 
auxiliary variables in the second block. The discrete mixture of normals is an approximation, 
but it outperforms the scale mixture of normals in terms of effective sampling rate, as it is 
much faster. 

One may also arrive at the hierarchy above by manipulating the random utility-derivation 



of McFadden (1974) by considering the difference of random utilities, or "dRUM," using 



the term of Fruhwirth-Schnatter and Friihwirth (2010). The dRUM representation is supe 



rior to the random utility approach explored in Friihwirth-Schnatter and Friihwirth (2007). 



Further work by Fussl et al. (2011) improves the approach for binomial logistic models. In 



this extension, one must use a table of different weights and variances representing different 
normal mixtures, to approximate a finite collection of type-Ill logistic distributions, and 
interpolate within this table to approximate the entire family. 



An alternative is the method of O'Brien and Dunson (2004) for multiple categories, 



which is related to the original work of Albert and Chib (1993b) for binomial logistic 
regression. Both ultimately suggest that a particular choice of Students link function 
makes a good approximation to the logistic link. However, this approximation has two 
layers of latent variables like the other data augmentation techniques discussed above. 
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i = 1 , . . . , 



Figure 2: Directed acyclic graphs depicting two latent-variable constructions for the 



logistic-regression model: the difference of random-utility model of Holmes and Held 



(2006) and Fruhwirth-Schnatter and Friihwirth (2010), on the left; versus our direct data- 



augmentation scheme, on the right. 



Our data-augmentation scheme differs from each of these approaches in several ways. 
First, it does not appeal directly to the random-utility interpretation of the logit model. 
Instead, it represents the logistic CDF as a mixture with respect to an infinite convolution of 
gammas. Second, the method is exact, in the sense of making draws from the correct joint 
posterior distribution, rather than an approximation to the posterior. Third, it requires 
only a single layer of latent variables. Evidence in Section 5 suggests that the last point 
confers an advantage in terms of efficiency. 

A similar approach to ours is that of Gramacy and Poison (2012b), who propose a latent- 
variable representation of a powered-up version of the logit likelihood. This representation 
is very useful for obtaining classical penalized-likelihood estimates via simulation. But the 
difficulty with performing fully Bayesian inference using this representation is that it leads 
to an improper mixing distribution for the latent variable. This requires modifications of 
the basic approach that make simulation very challenging in the general logit case. As our 
experiments show, the method does not seem to be competitive on speed grounds with the 
Polya-Gamma representation, which results in a proper mixing distribution for all common 
choices of 04, hi in ([!]) . 

For negative-binomial regression, Fruhwirth-Schnatter et al. ( 2009 ) employ the discrete- 
mixture/table-interpolation approach, like that used by Fussl et al. (2011), to produce a 
tractable data augmentation scheme. In some instances, the Polya-Gamma approach out 
performs this method; in others, it does not. The reasons for this discrepancy can be 
explained by examining the inner workings of our Polya-Gamma sampler, to which we now 
turn. 



4 Simulating Polya-Gamma random variables 
4.1 The PG(l,z) sampler 

All our developments thus far require an efficient method for sampling Polya-Gamma ran- 
dom variates. In this section, we derive such a method, which is implemented in the R 
package BayesLogit. 

One may sample Polya-Gamma random variables naively (and approximately) using the 
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sum-of-gammas representation in Equation ([2]). But this is slow, and involves the poten- 
tially dangerous step of truncating an infinite sum. We therefore construct an alternate, 
exact method by extending the approach of Devroye (2009) for simulating J*(l) from Q. 
The distribution J*(l) is related to the Jacobi theta function, so we call J*(l) the Jacobi 
distribution. One may define an exponentially tilted Jacobi distribution J* (l,z) via the 
density 



cosh(z) 



fix) 



(9) 



where f(x) is the density of J*(l). The PG(1, z) distribution is related to J*(l, z) through 
the rescaling 



PG(l,z) = -J*(M/2). 



(10) 



Devroye (2009) develops an efficient J* (1,0) sampler. Following this work, we develop 
an efficient sampler for an exponentially tilted J* random variate, J*(l,z). In both cases, 
the density of interest can be written as an infinite, alternating sum that is amenable to von 
Neumann's alternating sum method. Recall that a random variable with density / may be 
sampled by the accept/reject algorithm by: (1) proposing X from a density g; (2) drawing 
U ~ U(0,cg(X)) where ||//<?||oo < c; and (3) accepting X if U < f(X) and rejecting X 
otherwise. When f(x) = Y^=oi~^) n a n{%) and the coefficients a n {x) are decreasing for all 
n G No, for fixed x in the support of /, then the partial sums, S n (x) = X^=o( — ^Y a i( x )i 
satisfy 

S (x) > S 2 (x) >■■■> f(x) >■■■> S 3 (x) > 5i(s). (11) 

In that case, step (3) above is equivalent to accepting X if U < Si(X) for some odd i, 
and rejecting X if U > Si(X) for some even i. Moreover, the partial sums Si(X) can be 
calculated iteratively. Below we show that for the J*(l,z) distribution the algorithm will 
accept with high probability upon checking U < S\(X). 

The Jacobi density has two alternating-sum representations, Y^=oi~^-) n a n{ x ) an d Y^=q(~ 1 



) n a?(x) 



neither of which satisfy (11) for all x in the support of /. However, each satisfies (11) on 



an interval. These two intervals, respectively denoted It and Ir, satisfy II U Ir = (0, 00) 
and It D Ir ^ 0. Thus, one may pick t > and define the piecewise coefficients 



a n (x) 



> x ( 2 \ 3/2 f 2(n + l/2) 2 l . . 

vr(n + l/2 — exp<^ ^ 0<x<t, (12) 

\7TX J I X J 

f (n + l/2) 2 vr 2 1 , . 

vr(n + l/2) exp<^ j—*- x\ x > t, (13) 



so that f(x) = ^^Lo( — ^) n<1 n(x) satisfies the partial sum criterion (11) for x > 0. Devroye 
shows that the best choice of t is near 0.64. 

Employing Q, we now see that the J*(l,z) density can be written as an infinite, 
alternating sum f(x\z) = Yl^=o(~ ^-) n a n{x\z) , where 



a n (x\z) = cosh(2;) exp <j - : — ^ \ a n (x), 



2 

z X 



that satisfies (11) as a n+ \{x\z) / a n {x\z) = a n+ i(x)/a n (x). Since ao(x\z) > f(x\z), the first 
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term of the series provides a natural proposal: 



71" 

c(z) g(x\z) = — cosh(z) < 



2 y/ 2 r z*x i , 



(14) 

exp <! — I — + '— )x \ , x > t. 



Z 2 IT 2 



Examining these two kernels, one finds that X ~ g(x\z) may be sampled from a mixture of 
an inverse-Gaussian and an exponential: 



X 



'lG{\z\-\ l)l m with prob. p/(p + q) 

Ex(-z 2 /2 + 7r 2 /8)I (ti00 ) with prob. q/(p + q) 



where p(z) = J* c(z) g(x\z)dx and q(z) = c(z) g(x\z)dx. Note that we are implicitly 
suppressing the dependence of p, q, c, and g upon t. 

With this proposal in hand, sampling J*(l, z) proceeds as follows: 

1. Generate a proposal X ~ g(x\z). 

2. Generate U ~ W(0, c(z)c/(X|z)). 

3. Iteratively calculate S n (X\z), starting at Si(X\z), until U < S n (X\z) for an odd n or 
until C7 > S n (X\z) for an even n. 

4. Accept X if n is odd; return to step 1 if n is even. 

To sample Y ~ PG(1, z), draw X ~ J*(l, z/2) and then let Y = X/4. The details of the 
implementation, along with pseudocode, can be found in the technical supplement. 

4.2 Analysis of acceptance rate 



The J*(l,z) sampler is very efficient. The parameter c = c(z,t) found in (14) characterizes 
the average number of proposals we expect to make before accepting. Devroye shows that 
in the case of z = 0, one can pick t so that c(0, t) is near unity. We extend this result to 
non-zero tilting parameters and calculate that, on average, the J*(l,z) sampler rejects no 
more than 9 out of every 10,000 draws, regardless of z. 



Proposition 2. Define 



p( - . / ) / | cosh(z) exp \- Z ~y \ ao(x)>lx. 



o 



z 2 x 



q( z ,t) = J -cosh(z)exp| — — ja Q (x)dx. 
The following facts about the Polya-Gamma rejection sampler hold. 
1. The best truncation point t* is independent of z > 0. 
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2. For a fixed truncation point t, p(z, t) and q(z, t) are continuous, p(z, t) decreases to 
zero as z diverges, and q(z, t) converges to 1 as z diverges. Thus c(z, t) = p(z, t)+q(z, t) 
is continuous and converges to 1 as z diverges. 

3. For fixed t, the average probability of accepting a draw, l/c(z,t), is bounded below 
for all z. For t*, this bound to five digits is 0.99919, which is attained at z ~ 1.378. 

Proof. We consider each point in turn. Throughout, t is assumed to be in the interval of 
valid truncation points, 1^ D Ir. 

1. We need to show that for fixed z, c(z, t) = p(z, t) + q(z, t) has a maximum in t that is 
independent of z. For fixed z > 0, p(z, t) and q(z, t) are both differentiable in t. Thus 
any extrema of c will occur on the boundary of the interval I^f] Ir, or at the critical 
points for which §f = 0; that is t G II D Ir for which 



2 

cosh(z) exp { - Z -t} [a^{t) - a R (t)\ = 0. 



The exponential term is never zero, so an interior critical point must satisfy ag (t) — 
a o(t) = 0' which is independent of z. Devroye shows there is one such critical point, 
t* ~ 0.64, and that it corresponds to a maximum. 

2. Both p and q are integrals of recognizable kernels. Rewriting the expressions in terms 
of the corresponding densities and integrating yields 

. 7T 1 f I Z 2 7T 2 

p(ir,t) = cosh(z)-^y exp| - y{z)tj, y(z) = — + — , 

and 

q(z,t) = (l + e- 2z )^> IG (t\l/z,l) 

where $/g is the cumulative distribution function of an IG(l/z, 1) distribution. 

One can see that p(z, t) is eventually decreasing in z for fixed t by noting that the 
sign of |j is determined by 

tanh(z) - ^ Z ^ - zt , 

T + "8" 

which is eventually negative. (In fact, for the t* calculated above it appears to be 
negative for all z > 0, which we do not prove that here.) Further, p(z, t) is continuous 
in z and converges to as z diverges. 

To see that q(z,t) converges to 1, consider a Brownian motion (W s ) defined on the 
probability space (f2, J 7 , P) and the subsequent Brownian motion with drift X% = 
zs + W s . The stopping time T z = inf{s > 0|X| > 1} is distributed as IG(l/z, 1) and 
W{T Z <t) = P(max sg[0it] X z s > 1). Hence P(T 2 < t) is increasing and lim z ^ 00 P(T 2 < 
t) = 1, ensuring that q(z,t) oc (1 + e~ 2z )F(T z < t) converges to 1 as z — > oo as 
well. Continuity follows by considering the cumulative distribution F(T Z < t) = 
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$((zt - - exp(20t)$((-l - zt)/Vt), which is a composition of continuous 

functions in z. 

By the continuity and tail behavior of p and q, it follows that c(z, t) = p(z, t) + q(z, t), 
for fixed i, is continuous for all z and converges to 1 as z diverges. Further c(z, t) > 1 
since the target density and proposal density satisfy f(x\z) < c(^,t)g(a?|2) for all 
x > 0. Thus, c takes on its maximum over z. 

3. Since, for each t, c(z,t) is bounded above in z, we know that l/c(z,t) is bounded 
below above zero. For t* , we numerically calculate that l/c(z, t*) attains its minimum 
0.9991977 at z ~ 1.378; thus, l/c(z,t*) > 0.99919 suggesting that no more than 9 of 
every 10,000 draws are rejected on average. 

□ 

Since t* is the best truncation point regardless of z, we will assume that the truncation 
point has been fixed at t* and suppress it from the notation. 



4.3 Analysis of tail probabilities 

Proposition [2] tells us that the sampler rarely rejects a proposal. However, the algorithm 
might calculate many terms in the sum before deciding to accept or reject, and the sampler 
would be slow despite rarely rejecting. 

Happily, this is not the case. Suppose one samples X ~ J*(l,z). Let N denote the 
total number of proposals made before accepting and let L n be the number of partial 
sums Si,i = 1 . . . , L n that are calculated before deciding to accept or reject proposal 



n < N. A variant of theorem 5.1 from Devroye (1986) employs Wald's equation to show 
that that E[^^ =1 L n ] = YliLo fo° a i( x \ z )dx- For the worst enclosing envelope, z ~ 1.378, 
E[iV] = 1.0016; that is, on average, one rarely calculates anything beyond Si of the first 
proposal. A slight alteration of this theorem gives a more precise sense of how many terms 
in the partial sum must be calculated. 

Proposition 3. When sampling X ~ J*(l,z), the probability of deciding to accept or 
reject upon checking the nth partial sum S n , n > 1, is 



1 



c(z) Jo 



{a n -i(x\z) — a n (x\z)} dx. 



Proof. Let L denote the number of partials sums that are calculated before accepting or 
rejecting the proposal. That is, a proposal, X, is generated; U is drawn from U(0, ao(X\z)); 
and L is the smallest natural number n £ N for which U < S n if n is odd or U > S n if 
n is even, where S n denotes S n (X\z). But since L is the smallest n for which this holds, 
Sl-2 < U < Sl when L is odd and Sl < U < Sl-2 when L is even. Thus, the algorithm 
accepts or rejects if and only if U & Kl(X\z) where 



K n (x\z) 



(S n - 2 (x\z), S n (x\z)], odd n 
(S n (x\z),S n - 2 {x\z)], even n. 
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In either case, |K ra (x|z)| = a n -\(x\z) — a n (x\z). Thus 

\ a n -i(x\z) — a n (x\z) 
F(L = nX = x)= v 1 ; ; , " v 1 . 

a {x\z) 

Marginalizing over x yields 



1 f°° 

P(L = n) = — - / {a„_i(x|z) - a n (x\z)}dx. 
c{z) Jo 



□ 



Since each coefficient a n is the piecewise composition of an inverse Gaussian kernel and 
an exponential kernel, these integrals may be evaluated. In particular, 



a n {x\z) = cosh(z) < 



2e-( 2n+1 > PlG {x\ti n (z),\ n ), x<t 



where n n (z) = \ n = (2n + l) 2 , y n (z) = 0.5(z 2 + (n + l/2) 2 vr 2 ), and p IG and p £ are 

the corresponding densities. The table below shows the first several probabilities for the 
worst case envelope, z ~ 1.378. Clearly F(L > n) decays rapidly with n. 

n 1 2 3 4 

V(L > n) 8.023 x 1(T 4 1.728 x 8.213 x 1CT 18 8.066 x 10" 29 

Together with Proposition 2, this provides a strong guarantee of the efficiency of the PG(l,z) 
sampler. 

4.4 The general PG(b,z) case 

To sample from the entire family of PG(6, z) distributions, we exploit the additivity of the 
Polya-Gamma class. In particular, when b G N, one may sample PG(6, z) by taking b i.i.d. 
draws from PG(l,z) and summing them. In binomial logistic regression, one will always 
sample PG(6, z) using integral b. This will also be the case in negative- binomial regression 
if one chooses an integer over-dispersion parameter. In the technical supplement, we discuss 
the case of non-integral b. 

The run-time of the latent-variable sampling step is therefore roughly linear in the 
number of total counts in the data set. For example, to sample 1 million P61ya-Gamma(l,l) 
random variables took 0.70 seconds on a dual-core Apple laptop, versus 0.17 seconds for 
the same number of Gamma random variables. By contrast, to sample 1 million PG(10,1) 
random variables required 6.43 seconds, and to sample 1 million PG(100,1) random variables 
required 60.0 seconds. It is an open question as to whether there exists a faster method to 
simulate from the PG(n,c) distribution that does not require summing together n PG(l,c) 
draws. One interesting possibility that we are exploring is to use graphical-processing units 
to sample many PG(l,c)'s in parallel. This is an active subject of research. 
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5 Experiments 



We benchmarked the Polya-Gamma method against several alternatives for logit and negative- 
binomial models. Our purpose is to summarize the results presented in detail in our online 
technical supplement, to which we refer the interested reader. 

Our primary metrics of comparison are the effective sample size and the effective sam- 
pling rate, defined as the effective sample size per second of runtime. The effective sampling 
rate quantifies how rapidly a Markov-chain sampler can produce independent draws from 



the posterior distribution. Following Holmes and Held (2006), the effective sample size 



(ESS) for the ith parameter in the model is 



ESS i = M/{l + 2j2PiU)}, 



where M is the number of post-burn-in samples, and pi(j) is the jth autocorrelation of /3\ 



We use the coda package (Plummer et al. , 2006 ), which fits an AR model to approximate the 



spectral density at zero, to estimate each ESSi. All of the benchmarks are generated using 
R so that timings are comparable. Some R code makes external calls to C. In particular, 
the Polya-Gamma method calls a C routine to sample the Polya-Gamma random variates, 
just as R routines for sampling common distributions use externally compiled code. Here 
we report the median effective sample size across all parameters in the model. Minimum 
and maximum effective sample sizes are reported in the technical supplement. 
Our numerical experiments support several conclusions. 



In binary logit models. First, the Polya-Gamma is more efficient than all previously 
proposed data-augmentation schemes. This is true both in terms of effective sample size 
and effective sampling rate. Table [T] summarizes the evidence: across 6 real and 2 sim- 
ulated data sets, the Polya-Gamma method was always more efficient than the next-best 
data-augmentation scheme (typically by a factor of 200%-500%). This includes the ap- 



proximate random- utility methods of O'Brien and Dunson (2004) and Friihwirth-Schnatter 



and Fruhwirth (2010), and the exact method of Gramacy and Poison (2012b). Friihwirth- 



Schnatter and Fruhwirth (2010) find that their own method beats several other competi- 



tors, including the method of Holmes and Held (2006). We find this as well, and omit these 



timings from our comparison. Further details can be found in Section 3 of the technical 
supplement. 

Second, the Polya-Gamma method always had a higher effective sample size than the 
two default Metropolis samplers we tried. The first was a Gaussian proposal using Laplace's 
approximation. The second was a multivariate t§ proposal using Laplace's approximation 



to provide the centering and scale- matrix parameters, recommended by Rossi et al. (2005b) 



and implemented in the R package bayesm (Rossi, 2012). 



On 5 of the 8 data sets, the best Metropolis algorithm did have a higher effective 
sampling rate than the Polya-Gamma method, due to the difference in run times. But this 
advantage depends crucially on the proposal distribution, where even small perturbations 
can lead to surprisingly large declines in performance. For example, on the Australian 
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Table 1: Summary of experiments on real and simulated data for binary logistic regression. ESS: the 
median effective sample size for an MCMC run of 10,000 samples. ESR: the median effective sample rate, 
or median ESS divided by the runtime of the sampler in seconds. AC: Australian credit data set. GC1 
and GC2: partial and full versions of the German credit data set. Siml and Sim2: simulated data with 
orthogonal and correlated predictors, respectively. Best RU-DA: the result of the best random-utility data- 
augmentation algorithm for that data set. Best Metropolis: the result of the Metropolis algorithm with the 
most efficient proposal distribution among those tested. See the technical supplement for full details. 

Data set 







Nodal 


Diab. 


Heart 


AC 


GC1 


GC2 


Siml 


Sim2 


ESS 


Polya-Gamma 


4860 


5445 


3527 


3840 


5893 


5748 


7692 


2612 




Best RU-DA 


1645 


2071 


621 


1044 


2227 


2153 


3031 


574 




Best Metropolis 


3609 


5245 


1076 


415 


3340 


1050 


4115 


1388 


ESR 


Polya-Gamma 


1632 


964 


634 


300 


383 


258 


2010 


300 




Best RU-DA 


887 


382 


187 


69 


129 


85 


1042 


59 




Best Metropolis 


2795 


2524 


544 


122 


933 


223 


2862 


537 



Table 2: Summary of experiments on real and simulated data for binary logistic mixed models. Metropolis: 
the result of an independence Metropolis sampler based on the Laplace approximation. Using a te proposal 
yielded equally poor results. See the technical supplement for full details. 

Data set 







Synthetic 


Polls 


Xerop 


ESS 


Polya-Gamma 


6976 


9194 


3039 




Metropolis 


3675 


53 


3 


ESR 


Polya-Gamma 


957 


288 


311 




Metropolis 


929 


0.36 


0.01 



credit data set (labeled AC in the table), the Gaussian proposal led to a median effective 
sampling rate of 122 samples per second. The very similar multivariate tQ proposal led to 
far more rejected proposals, and gave an effective sampling rate of only 2.6 samples per 
second. Diagnosing such differences for a specific problem may cost the user more time 
than is saved by a slightly faster sampler. 

Finally, the Polya-Gamma method truly shines when the model has a complex prior 
structure. In general, it is very difficult to design good Metropolis samplers for these 
problems. For example, consider a binary logit mixed model with grouped data and a 
random-effects structure, where the log-odds of success for observation j in group % are 
ipij = Oii + Xijfy, and where either the «j, the ft, or both receive further hyperpriors. 
It is not clear that a good default Metropolis sampler is easily constructed unless there 
are a large number of observations per group. Table [2] shows the results of naively using 
an independence Metropolis sampler based on the Laplace approximation to the full joint 
posterior. For a synthetic data set with a balanced design of 100 observations per group, the 
Polya-Gamma method is slightly better. For the two real data sets with highly unbalanced 
designs, it is much better. 

Of course, it is certainly possible to design and tune better Metropolis-Hastings samplers 
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Table 3: Summary of experiments on simulated data for negative-binomial models. Metropolis: the result 
of an independence Metropolis sampler based on a ta proposal. FS09: the algorithm of Fmhwirth-Sch natter| 



et al. (20091. Siml and Sim2: simulated negative-binomial regression problems. GP1 and GP2: simulated 



Gaussian-process spatial models. The independence Metropolis algorithm is not applicable in the spatial 
models, where there as many parameters as observations. 









Data set 








Regl 


Reg2 


GP1 


GP2 




Total Counts 


3244 


9593 


9137 


22732 


ESS 


Polya-Gamma 


7646 


3590 


6309 


6386 




FS09 


719 


915 


1296 


1157 




Metropolis 


749 


764 






ESR 


Polya-Gamma 


285 


52 


62 


3.16 




FS09 


86 


110 


24 


0.62 




Metropolis 


73 


87 







for mixed models; see, for example, Gamerman (1997). We simply point out that what 



works well in the simplest case need not work well in a slightly more complicated case. The 
advantages of the Polya-Gamma method are that it requires no tuning, is very simple to 
implement, and gives optimal or near-optimal performance across a range of cases. 

In negative-binomial models. The Polya-Gamma method consistently yields the best 
effective sample sizes in negative-binomial regression. However, its effective sampling rate 
suffers when working with a large numbers of counts or a non-integral over-dispersion pa- 
rameter. Currently, our Polya-Gamma sampler can draw from PG(6, ip) quickly when b = 1, 
but not for general, integral b: to sample from PG(6, ip) when b £ N, we take b independent 
samples of PG(1, ip) and sum them. Thus in negative-binomial models, one must sample at 
least YliLi m Polya-Gamma random variates, where %ji is the ith response, at every MCMC 
iteration. When the number of counts is relatively high, this becomes a burden. 

The columns labeled Regl and Reg2 of Table [3] show results for data simulated from 
a negative-binomial model with 400 observations and 3 regressors. (See the technical sup- 
plement for details.) In the first case (Regl), the intercept is chosen so that the average 
outcome is a count of 8 (3244 total counts). Given the small average count size, the Polya- 
Gamma method has a superior effective sampling rate compared to the approximate method 



of Fruhwirth-Schnatter et al. (2009), the next-best choice. In the second case (Reg2), the 



average outcome is a count of 24 (9593 total counts). Here the Fruhwirth-Schnatter et al. 
algorithm finishes more quickly, and therefore has a better effective sampling rate. In both 
cases we restrict the sampler to integer over-dispersion parameters. 

As before, the Polya-Gamma method starts to shine when working with more compli- 
cated hierarchical models that devote proportionally less time to sampling the auxiliary 
variables. For instance, consider a spatial model where we observe counts yi,...,y n at lo- 
cations xx,...,x n , respectively. It is natural to model the log rate parameter as a Gaussian 
process: 

yi~NB(n,l/{l + e-*<}), tP~GP{0,K), 
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where ip = (ipi, . . . ,')pn) T and K is constructed by evaluating a covariance kernel at the 
locations x%. For example, under the squared-exponential kernel, we have 



K ij = k + exp 



d {x% , Xj ) 

2£ 2 



with characteristic length scale i, nugget k, and distance function d (in our examples, 
Euclidean distance). 



Using either the Polya-Gamma or the Friihwirth-Schnatter et al. (2009) techniques, 



one arrives at a multivariate Gaussian conditional for ip whose covariance matrix involves 
latent variables. Producing a random variate from this distribution is expensive, as one 
must calculate the Cholesky decomposition of a relatively large matrix at each iteration. 
Therefore, the overall sampler spends relatively less time drawing auxiliary variables. Since 
the Polya-Gamma method leads to a higher effective sample size, it wastes fewer of the 
expensive draws for the main parameter. 

The columns labeled GP1 and GP2 of Table [3] show two such examples. In the first 
synthetic data set, 256 equally spaced x points were used to generate a draw for -0 from a 
Gaussian process with length scale i = 0.1 and nugget = 0.0. The average count y = 35.7, 
yielding 9137 total counts (roughly the same as in the second regression example, Sim2). In 
the second synthetic data set, we simulated if] from a Gaussian process over 1000 x points, 
with length scale £ = 0.1 and a nugget = 0.0001. This yielded 22,720 total counts. In both 
cases, the Polya-Gamma method led to a more efficient sampler — by a factor of 3 for the 
smaller problem, and 5 for the larger. 

6 Discussion 

We have shown that Bayesian inference for logistic models can be implemented using a 
data augmentation scheme based on the novel class of Polya-Gamma distributions. This 
leads to simple Gibbs-sampling algorithms for posterior computation that exploit standard 
normal linear-model theory, and that are notably simpler than previous schemes. We have 
also constructed an accept /reject sampler for the new family, with strong guarantees of 
efficiency (Propositions 2 and 3). 

The evidence suggests that our data-augmentation scheme is the best current method 
for fitting complex Bayesian hierarchical models with binomial likelihoods. It also opens 
the door for exact Bayesian treatments of many modern-day machine-learning classification 



methods based on mixtures of logits (e.g. Salakhutdinov et al. 2007 Blei and Lafferty 



2007). Applying the Polya-Gamma mixture framework to such problems is currently an 
active area of research. 

Moreover, posterior updating via exponential tilting is a quite general situation that 
arises in Bayesian inference incorporating latent variables. In our case, the posterior dis- 
tribution of uj that arises under normal pseudo-data with precision uj and a PG(6, 0) prior 
is precisely an exponentially titled PG(6, 0) random variable. This led to our characteriza- 
tion of the general PG(6, c) class. Notice, moreover, that one may identify the conditional 
posterior for ujij strictly using its moment-generating function, without ever appealing to 
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Bayes' rule for density functions. This follows the Levy-penalty framework of Poison and 



Scott (2012) and relates to work by Ciesielski and Taylor (1962) on the sojourn times of 



Brownian motion. There may be many other modeling situations where the basic idea is 
also applicable. 

Our benchmarks have relied upon serial computation. However, one may trivially par- 
allelize a vectorized Polya-Gamma draw on a multicore CPU. Devising such a sampler for a 
graphical-processing unit (GPU) is less straightforward, but potentially more fruitful. The 
massively parallel nature of GPUs offer a solution to the sluggishness found when sampling 
PG(n, z) variables for large, integral n, which was the largest source of inefficiency with the 
negative-binomial results presented earlier. 
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Technical Supplement 



SI Details of Polya- Gamma sampling algorithm 



Algorithm [T] shows pseudo-code for sampling the P61ya-Gamma(l, z) distribution. Recall 
from the main manuscript that one may pick t > and define the piecewise coefficients 



a n {x) 



yr(n + l/2) 



TTX 



3/2 



exp 



2{n+l/2f 



, (n + l/2) 2 vr 2 
vr(n + 1/2) exp { -± ^ x 



< x < t, 
x > t, 



(15) 
(16) 



so that f(x) = Y^=o(~^) n a n{x) satisfies the partial sum criterion for x > 0. 

To complete the analysis of the Polya-Gamma sampler, we specify our method for 
sampling truncated inverse Gaussian random variables, IG(l/z, l)!^]- When z is small 
the inverse Gaussian distribution is approximately inverse Xii motivating an accept-reject 
algorithm. When z is large, most of the inverse Gaussian distribution's mass will be below 
the truncation point t, motivating a rejection algorithm. Thus, we take a two pronged 
approach. 

When \ jz > t we generate a truncated inverse-Gaussian random variate using accept- 
reject sampling using the proposal distribution (l/xi)I(t,oo)- The proposal X is generated 
following Devroye (2009). Considering the ratio of the kernels, one finds that P(accept|X = 
x) = exp(— xz' 2 /2). Since z < 1/t and X < t we may compute a lower bound on the average 
rate of acceptance: 



E 



exp 



-X 



> exp 



-1 
2t 



0.61 



See algorithm ^ for pseudocode. 

When 1/z < t, we generate a truncated inverse-Gaussian random variate using rejec- 
tion sampling. |Devroye" ( 1986 ) (p. 149) describes how to sample from an inverse-Gaussian 
distribution using a many-to-one transformation. Sampling X in this fashion until X < t 
yields an acceptance rate bounded below by 



I IG(x\l/z,\ = l)dx> [ IG(x\t, 
Jo Jo 



0.67 



for all 1/z < t. See Algorithm [3] for pseudocode. 

Recall that when b is an integer, we draw PG(6, z) by summing b i.i.d. draws from 
PG(l,z). When b is not integral, write b = [b\ + e, where [b\ is the integral part of b, 
and sum a draw from PG([&j , z), using the method previously described, with a draw from 
PG(e, z), using the finite sum-of-gammas approximation. With 200 terms in the sum, we 
find that the approximation is quite accurate for such small values of the first parameter, as 
each Ga(e, 1) term in the sum tends to be small, and the weights in the sum decay like 1/k 2 . 
This, in contrast, may not be the case when using the finite sum-of-gammas approximation 
for arbitrary b. The general, approximate sampler is only necessary for negative binomial 
regression. 
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Algorithm 1 Sampling from PG(l,z) 
Input: z, a positive real number 

Define: pigauss(i | [i, A), the CDF of the inverse Gaussian distribution 
Define: a n (x), the piecewise-defined coefficients in (15) and (16). 
z <- \z\/2, t <- 0.64, K <- vr 2 /8 + z 2 /2 
P <~ exp(-Kt) 

g <— 2 exp(— |z|) pigauss(i | /x = 1/z, A = 1.0) 
repeat 

Generate U, V ~ W(0, 1) 
if £7 < p/(p + g) then 

(Truncated Exponential) 
X + £/K where £ ~ £(1) 
else 

(Truncated Inverse Gaussian) 
H <— l/z 
if /x > t then 
repeat 

Generate l/X ~ x 2 l (ti0o) 

until W(0,1) < exp(-^X) 
else 

repeat 

Generate X ~ ZA/"(/x, 1.0) 
until X < i 
end if 
end if 

S ^ a (X), Y ^VS,n^0 
repeat 

n <— n + 1 

if n is odd then 

S <-S- a n (X); if Y < S, then return X / 4 

else 

5 <- 5 + a n (X); if Y > S, then break 
end if 
until FALSE 
until FALSE 



Algorithm 2 Algorithm used to generate IG(fi = l/z, A = 1)1 (o,t) when /x > t. 

Truncation point: t. z = 1/fi. 
repeat 

repeat 

Generate E,E' ~ £(1). 

until £ 2 < 

A <- t/(l + tE) 2 

a <— exp(^z 2 X) 

U ~ W 
until U < a 
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Algorithm 3 Algorithm used to generate IG(fi = l/z, A = 1)1(q t > when \i < t. 

Truncation point: t. z = 
repeat 

Y ~ N(0,1) 2 . 

X <- \i + 0.5fi 2 Y - O.bfi^AfiY + (JjYf 

u ~u 

If (U > n/(p, + X)), then X <- ^/X. 
until X < i?. 



Density of PG(b,0) 



Density of PG(1,c) 
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Figure 3: Plots of the density of the Polya-Gamma distribution PG(6, c) for various values 
of b and c. Note that the horizontal and vertical axes differ in each plot. 

S2 Benchmarks: overview 



We benchmark the Polya-Gamma method against several alternatives for binary logistic 
regression and negative binomial regression for count data to measure its relative perfor- 
mance. All of these benchmarks are empirical and hence some caution is urged. Our 
primary metric of comparison is the effective sampling rate, which is the effective sam- 
ple size per second and which quantifies how quickly a sampler can produce independent 
draws from the posterior distribution. However, this metric is sensitive to numerous id- 
iosyncrasies relating to the implementation of the routines, the language in which they are 
written, and the hardware on which they are run. We generate these benchmarks using R, 
though some of the routines make calls to external C code. The specifics of each method 
are discussed in further detail below. In general, we find that the Polya-Gamma technique 
compares favorably to other data augmentation methods. Specifically, the Polya-Gamma 



technique performs better than the methods of 


O'Brien and Dunson 


(2004 


), 


Gramacy and 


Poison 


(2012b 


), and 


Friihwirth-Schnatter and Friihwirth 


(2010 


)• 


Friihwirth-Schnatter and 



Friihwirth (2010) provides a detailed comparison of several methods itself. For instance, the 



authors find that method of Holmes and Held ( 2006 ) did not beat their discrete mixture of 



normals. We find this as well and hence omit it from the comparisons below. 

For each data set, we run 10 MCMC simulations with 12,000 samples each, discarding 
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the first 2,000 as burn-in, thereby leaving 10 batches of 10,000 samples. The effective 



sample size for each regression coefficient is calculated using the coda (Plummer et al. 



2006) package and averaged across the 10 batches. The component-wise minimum, median, 
and maximum of the (average) effective sample sizes are reported to summarize the results. 
A similar calculation is performed to calculate minimum, median, and maximum effective 
sampling rates (ESR). The effective sampling rate is the ratio of of effective sample size to 
the time taken to produce the sample. Thus, the effective sampling rates are normalized by 
the time taken to produce the 10,000 samples, disregarding the time taken for initialization, 
preprocessing, and burn-in. When discussing the various methods the primary metric we 



refer to is the median effective sampling rate, following the example of Friihwirth-Schnatter 



and Fruhwirth| ( |2010[ ). 

All of these experiments are carried out using R 2.15.1 on an Ubuntu machine with 8GB 
or RAM and an Intel Core i5 quad core processor. The number of cores is a potentially 
important factor as some libraries, including those that perform the matrix operations in 
R, may take advantage of multiple cores. The C code that we have written does not use 
parallelism. 

In the sections that follow, each table reports the following metrics: 

• the execution time of each method in seconds; 

• the acceptance rate (relevant for the Metropolis samplers); 

• the minimum, median, and maximum effective sample sizes (ESS) across all fixed or 
random effects; and 

• the minimum, median, and maximum effective sampling rates (ESR) across all fixed 
or random effects, defined as the effective sample size per second of runtime. 

S3 Benchmarks: binary logistic regression 
S3.1 Data Sets 



Nodal: part of the boot R package (Canty and Ripley, 2012). The response indicates 



if cancer has spread from the prostate to surrounding lymph nodes. There are 53 
observations and 5 binary predictors. 

Pima Indian: There are 768 observations and 8 continuous predictors. It is noted on the 
UCI website^ that there are many predictor values coded as 0, though the physical 
measurement should be non-zero. We have removed all of those entries to generate a 
data set with 392 observations. The marginal mean incidence of diabetes is roughly 
0.33 before and after removing these data points. 

Heart: The response represents either an absence or presence of heart disease J^] There 
are 270 observations and 13 attributes, of which 6 are categorical or binary and 1 is 
ordinal. The ordinal covariate has been stratified by dummy variables. 

Australian Credit: The response represents either accepting or rejecting a credit card 
application]^] The meaning of each predictor was removed to protect the propriety 
of the original data. There are 690 observations and 14 attributes, of which 8 are 
categorical or binary. There were 37 observations with missing attribute values. These 
missing values were replaced by the mode of the attribute in the case of categorical 
data and the mean of the attribute for continuous data. This dataset is linearly 



1 


http : //archive 


ics . uci . edu/ml/datasets/Pima+Indians+Diabetes 






http : / /archive 


ics .uci . edu/ml/datasets/Statlog+ (Heart) 




:• 


http : / /archive 


ics .uci . edu/ml/datasets/Statlog+(Australian+Credit+Approval) 
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separable and results in some divergent regression coefficients, which are kept in check 
by the prior. 

German Credit 1 and 2: The response represents either a good or bad credit riskj^] 
There are 1000 observations and 20 attributes, including both continuous and cat- 
egorical data. We benchmark two scenarios. In the first, the ordinal covariates have 
been given integer values and have not been stratified by dummy variables, yielding 
a total of 24 numeric predictors. In the second, the ordinal data has been stratified 
by dummy variables, yielding a total of 48 predictors. 

Synthetic 1: Simulated data with 150 outcomes and 10 predictors. The design points were 
chosen to be orthogonal. The data are included as a supplemental file. 

Synthetic 2: Simulated data with 500 outcomes and 20 predictors. The design points were 
simulated from a Gaussian factor model, to yield pronounced patterns of collinearity. 
The data are included as a supplemental file. 



S3.2 Methods 

All of these routines are implemented in R, though some of them make calls to C. In 
particular, the independence Metropolis samplers do not make use of any non-standard 
calls to C, though their implementations have very little R overhead in terms of function 
calls. The Polya-Gamma method calls a C routine to sample the Polya-Gamma random 
variates, but otherwise only uses R. 

As a check upon our independence Metropolis sampler we include the independence 



Metropolis sampler of Rossi et al. (2005a), which may be found in the bayesm package 



(Rossi 2012 ). Their sampler uses a te proposal, while ours uses a normal proposal. The suite 



of routines in the binomlogit package (Fussl 2012) implement the techniques discussed in 



Fussl et al. (2011). One routine provided by the binomlogit package coincides with the 



technique described in Friihwirth-Schnatter and Friihwirth (2010) for the case of binary 
logistic regression. A separate routine implements the latter and uses a single call to C. 



Gramacy and Poison's R package, reglogit, also calls external C code (Gramacy, 2012) 



For every data set the regression coefficient was given a diffuse N(0, 0.01/) prior, except 
when using Gramacy and Poison's method, in which case it was given a exp(^ i |/3j/100|) 
prior per the specifications of the reglogit package. The following is a short description 
of each method along with its abbreviated name. 
PG: The Polya-Gamma method described previously. 



FS: Friihwirth-Schnatter and Friihwirth (2010) follow Holmes and Held (2006) and use the 



representation 



l{zi > 0} 



Xi/3 + €{ 



Lo , 



(17) 



where Lo is the standard logistic distribution (c.f. Albert and Chib, 1993a, for the 



probit case). They approximate p(ej) using a discrete mixture of normals. 
IndMH: Independence Metropolis with a normal proposal using the posterior mode and 

the Hessian at the mode for the mean and precision matrix. 
RAM: after Rossi, Allenby, and McCulloch. An independence Metropolis with a te proposal 

from the R package bayesm (Rossi, 2012). Calculate the posterior mode and the 



Hessian at the mode to pick the mean and scale matrix of the proposal. 
OD: The method of O'Brien and Dunson (2004). Strictly speaking, this is not logistic 



regression; it is binary regression using a Student-i cumulative distribution function 



http : //archive . ics .uci . edu/ml/datasets/Statlog+(German+Credit+Data) 
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Table 4: Nodal data: N = 53, P = 6 



Method 


time 


ARate 


ESS. min 


ESS.med 


ESS. max 


ESR.min 


ESR.med 


ESR.max 


PG 


2.98 


1.00 


3221.12 


4859.89 


5571.76 


1081.55 


1631.96 


1871.00 


IndMH 


1.76 


0.66 


1070.23 


1401.89 


1799.02 


610.19 


794.93 


1024.56 


RAM 


1.29 


0.64 


3127.79 


3609.31 


3993.75 


2422.49 


2794.69 


3090.05 


OD 


3.95 


1.00 


975.36 


1644.66 


1868.93 


246.58 


415.80 


472.48 


FS 


3.49 


1.00 


979.56 


1575.06 


1902.24 


280.38 


450.67 


544.38 


dRUMAuxMix 


2.69 


1.00 


1015.18 


1613.45 


1912.78 


376.98 


598.94 


710.30 


dRUMIndMH 


1.41 


0.62 


693.34 


1058.95 


1330.14 


492.45 


751.28 


943.66 


IndivdRUMIndMH 


1.30 


0.61 


671.76 


1148.61 


1339.58 


518.79 


886.78 


1034.49 


dRUMHAM 


3.06 


1.00 


968.41 


1563.88 


1903.00 


316.82 


511.63 


622.75 


GP 


17.86 


1.00 


2821.49 


4419.37 


5395.29 


157.93 


247.38 


302.00 



as the inverse link function. 



dRUMAuxMix: Work by Fussl et al. (2011) that extends the technique of Friihwirth- 



Schnatter and Friihwirth (2010). A convenient representation is found that relies on 



a discrete mixture of normals approximation for posterior inference that works for 



binomial logistic regression. From the R package binomlogit (Fussl 2012). 



dRUMIndMH: Similar to dRUMAuxMix, but instead of using a discrete mixture of nor- 
mals, use a single normal to approximate the error term and correct using Metropolis- 
Hastings. From the R package binomlogit. 

IndivdRUMIndMH: This is the same as dRUMIndMH, but specific to binary logistic re- 
gression. From the R package binomlogit. 

dRUMHAM: Identical to dRUMAuxMix, but now use a discrete mixture of normals ap- 
proximation in which the number of components to mix over is determined by yi/rii. 
From the R package binomlogit. 



GP: after Gramacy and Poison (2012a). Another data augmentation scheme with only a 



single layer of latents. This routine uses a double exponential prior, which is hard- 



coded in the R package reglogit (Gramacy, 2012). We set the scale of this prior to 



agree with the scale of the normal prior we used in all other cases above. 



S3.3 Results 



The results are shown in Tables [4] through 11 As mentioned previously, these are averaged 
over 10 runs. 



S4 Benchmarks: logit mixed models 

A major advantage of data augmentation, and hence the Polya-Gamma technique, is that it 
is easily adapted to more complicated models. We consider three examples of logistic mixed 
model whose intercepts are random effects, in which case the log odds for observation j from 
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Table 5: Diabetes data, N=270, P=19 



Method 


time 


ARate 


ESS. min 


ESS.med 


ESS. max 


ESR.min 


ESR.med 


ESR.max 


PG 


5.65 


1.00 


3255.25 


5444.79 


6437.16 


576.14 


963.65 


1139.24 


IndMH 


2.21 


0.81 


3890.09 


5245.16 


5672.83 


1759.54 


2371.27 


2562.59 


RAM 


1.93 


0.68 


4751.95 


4881.63 


5072.02 


2456.33 


2523.85 


2621.98 


OD 


6.63 


1.00 


1188.00 


2070.56 


2541.70 


179.27 


312.39 


383.49 


FS 


6.61 


1.00 


1087.40 


1969.22 


2428.81 


164.39 


297.72 


367.18 


dRUMAuxMix 


6.05 


1.00 


1158.42 


1998.06 


2445.66 


191.52 


330.39 


404.34 


dRUMIndMH 


3.82 


0.49 


647.20 


1138.03 


1338.73 


169.41 


297.98 


350.43 


IndivdRUMIndMH 


2.91 


0.48 


614.57 


1111.60 


1281.51 


211.33 


382.23 


440.63 


dRUMHAM 


6.98 


1.00 


1101.71 


1953.60 


2366.54 


157.89 


280.01 


339.18 


GP 


88.11 


1.00 


2926.17 


5075.60 


5847.59 


33.21 


57.61 


66.37 



Table 6: Heart data: N = 270, P = 19 



Method 


time 


ARate 


ESS.min 


ESS.med 


ESS.max 


ESR.min 


ESR.med 


ESR.max 


PG 


5.56 


1.00 


2097.03 


3526.82 


4852.37 


377.08 


633.92 


872.30 


IndMH 


2.24 


0.39 


589.64 


744.86 


920.85 


263.63 


333.19 


413.03 


RAM 


1.98 


0.30 


862.60 


1076.04 


1275.22 


436.51 


543.95 


645.13 


OD 


6.68 


1.00 


620.90 


1094.27 


1596.40 


93.03 


163.91 


239.12 


FS 


6.50 


1.00 


558.95 


1112.53 


1573.88 


85.92 


171.04 


241.96 


dRUMAuxMix 


5.97 


1.00 


604.60 


1118.89 


1523.84 


101.33 


187.49 


255.38 


dRUMIndMH 


3.51 


0.34 


256.85 


445.87 


653.13 


73.24 


127.28 


186.38 


IndivdRUMIndMH 


2.88 


0.35 


290.41 


467.93 


607.80 


100.70 


162.25 


210.79 


dRUMHAM 


7.06 


1.00 


592.63 


1133.59 


1518.72 


83.99 


160.72 


215.25 


GP 


65.53 


1.00 


1398.43 


2807.09 


4287.55 


21.34 


42.84 


65.43 



Table 7: Australian Credit: N = 690, P = 35 



Method 


time 


ARate 


ESS.min 


ESS.med 


ESS.max 


ESR.min 


ESR.med 


ESR.max 


PG 


12.78 


1.00 


409.98 


3841.02 


5235.53 


32.07 


300.44 


409.48 


IndMH 


3.42 


0.22 


211.48 


414.87 


480.02 


61.89 


121.53 


140.59 


RAM 


3.92 


0.00 


8.27 


10.08 


26.95 


2.11 


2.57 


6.87 


OD 


14.59 


1.00 


28.59 


988.30 


1784.77 


1.96 


67.73 


122.33 


FS 


15.05 


1.00 


36.22 


1043.69 


1768.47 


2.41 


69.37 


117.53 


dRUMAuxMix 


14.92 


1.00 


29.34 


991.32 


1764.40 


1.97 


66.44 


118.27 


dRUMIndMH 


8.93 


0.19 


13.03 


222.92 


435.42 


1.46 


24.97 


48.76 


IndivdRUMIndMH 


7.38 


0.19 


13.61 


220.02 


448.76 


1.85 


29.83 


60.84 


dRUMHAM 


18.64 


1.00 


28.75 


1040.74 


1817.85 


1.54 


55.84 


97.53 


GP 


162.73 


1.00 


95.81 


2632.74 


4757.04 


0.59 


16.18 


29.23 
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Table 8: German Credit 1: N = 1000, P = 25 



Method 


time 


ARate 


ESS.min 


ESS.med 


ESS. max 


ESR.min 


ESR.med 


ESR.max 


PG 


15.37 


1.00 


3111.71 


5893.15 


6462.36 


202.45 


383.40 


420.44 


IndMH 


3.58 


0.68 


2332.25 


3340.54 


3850.71 


651.41 


932.96 


1075.47 


RAM 


4.17 


0.43 


1906.23 


2348.20 


2478.68 


457.11 


563.07 


594.30 


OD 


17.32 


1.00 


1030.53 


2226.92 


2637.98 


59.51 


128.59 


152.33 


FS 


18.21 


1.00 


957.05 


2154.06 


2503.09 


52.55 


118.27 


137.43 


dRUMAuxMix 


18.13 


1.00 


955.41 


2150.59 


2533.40 


52.68 


118.60 


139.70 


dRUMIndMH 


10.60 


0.29 


360.72 


702.89 


809.20 


34.03 


66.30 


76.33 


IndivdRUMIndMH 


8.35 


0.29 


334.83 


693.41 


802.33 


40.09 


83.04 


96.08 


dRUMHAM 


22.15 


1.00 


958.02 


2137.13 


2477.10 


43.25 


96.48 


111.84 


GP 


223.80 


1.00 


2588.07 


5317.57 


6059.81 


11.56 


23.76 


27.08 



Table 9: German Credit 2: N = 1000, P = 49 



Method 


time 


ARate 


ESS.min 


ESS.med 


ESS.max 


ESR.min 


ESR.med 


ESR.max 


PG 


22.30 


1.00 


2803.23 


5748.30 


6774.82 


125.69 


257.75 


303.76 


IndMH 


4.72 


0.41 


730.34 


1050.29 


1236.55 


154.73 


222.70 


262.05 


RAM 


6.02 


0.00 


5.49 


14.40 


235.50 


0.91 


2.39 


39.13 


OD 


25.34 


1.00 


717.94 


2153.05 


2655.86 


28.33 


84.96 


104.80 


FS 


26.44 


1.00 


727.17 


2083.48 


2554.62 


27.50 


78.80 


96.62 


dRUMAuxMix 


26.91 


1.00 


755.31 


2093.68 


2562.11 


28.06 


77.80 


95.21 


dRUMIndMH 


14.66 


0.13 


132.74 


291.11 


345.12 


9.05 


19.86 


23.54 


IndivdRUMIndMH 


12.45 


0.13 


136.57 


290.13 


345.22 


10.97 


23.31 


27.73 


dRUMHAM 


35.99 


1.00 


742.04 


2075.41 


2579.42 


20.62 


57.67 


71.67 


GP 


243.41 


1.00 


2181.84 


5353.41 


6315.71 


8.96 


21.99 


25.95 



Table 10: Synthetic 1, orthogonal predictors: iV = 150,P = 10 



Method 


time 


ARate 


ESS.min 


ESS.med 


ESS.max 


ESR.min 


ESR.med 


ESR.max 


PG 


3.83 


1.00 


6140.81 


7692.04 


8425.59 


1604.93 


2010.44 


2201.04 


FS 


4.46 


1.00 


2162.42 


2891.85 


3359.98 


484.91 


648.41 


753.38 


IndMH 


1.87 


0.78 


3009.10 


4114.86 


4489.16 


1609.67 


2200.72 


2397.94 


RAM 


1.54 


0.64 


3969.87 


4403.51 


4554.04 


2579.84 


2862.12 


2960.05 


OD 


4.88 


1.00 


2325.65 


3030.71 


3590.09 


476.36 


620.74 


735.29 


dRUMIndMH 


2.10 


0.53 


1418.07 


1791.71 


2030.70 


676.70 


854.94 


968.96 


dRUMHAM 


4.34 


1.00 


2170.71 


2887.57 


3364.68 


500.67 


666.18 


776.37 


dRUMAuxMix 


3.79 


1.00 


2207.30 


2932.21 


3318.37 


583.11 


774.58 


876.59 


IndivdRUMIndMH 


1.72 


0.53 


1386.35 


1793.50 


2022.31 


805.40 


1042.20 


1174.97 


GP 


38.53 


1.00 


5581.31 


7284.98 


8257.91 


144.85 


189.07 


214.32 
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Table 11: Synthetic 2, correlated predictors: N = 500, P = 20 



Method 


time 


ARate 


ESS.min 


ESS.med 


ESS.max 


ESR.min 


ESR.med 


ESR.max 


PG 


8.70 


1.00 


1971.61 


2612.10 


2837.41 


226.46 


300.10 


325.95 


FS 


9.85 


1.00 


459.59 


585.91 


651.05 


46.65 


59.48 


66.09 


IndMH 


2.52 


0.42 


826.94 


966.95 


1119.81 


327.98 


382.96 


443.65 


RAM 


2.59 


0.34 


1312.67 


1387.94 


1520.29 


507.54 


536.84 


588.10 


OD 


9.67 


1.00 


428.12 


573.75 


652.30 


44.28 


59.36 


67.48 


dRUMIndMH 


5.35 


0.33 


211.14 


249.33 


281.50 


39.46 


46.58 


52.59 


dRUMHAM 


11.18 


1.00 


452.50 


563.30 


644.73 


40.46 


50.37 


57.65 


dRUMAuxMix 


9.51 


1.00 


422.00 


564.95 


639.89 


44.39 


59.43 


67.31 


IndivdRUMIndMH 


4.17 


0.32 


201.50 


239.50 


280.35 


48.37 


57.51 


67.30 


GP 


114.98 


1.00 


748.71 


1102.59 


1386.08 


6.51 


9.59 


12.06 



group i, ipij, is modeled by 

tpij = cx-i + x i:j j3 

cti ~ N(m,l/(j)) 

m ~ iV(0,K 2 /</>) 

cf) ~ Go(l,l) 

P ~ #(0,1007). (18) 

An extra step is easily added to the Polya-Gamma Gibbs sampler to estimate (a, f3, m) 
and <p. We use the following three data sets to benchmark the Polya-Gamma method. 

Synthetic: A synthetically generated dataset with 5 groups, 100 observations within each 
group, and a single fixed effect. 



Polls: Voting data from a Presidential campaign (Gelman and Hill, 2006). The response 
indicates a vote for or against former President George W. Bush. There are 49 groups 
corresponding to states. Some states have very few observations, requiring a model 
that shrinks coefficients towards a global mean to get reasonable estimates. A single 
fixed effect for the race of the respondent is included, although it would be trivial to 
include other covariates. Entries with missing data were deleted to yield a total of 
2015 observations. 



Xerop: The Xerop data set from the epicalc R package ( Chongsuvivatwong 2012). In- 
donesian children were observed to examine the causes of respiratory infections; of 
specific interest is whether vitamin A deficiencies cause such illness. Multiple obser- 
vations of each individual were made. The data is grouped by individual id yielding a 
total of 275 random intercepts. A total of 5 fixed effects are included in the model — 
age, sex, height, stunted growth, and season — corresponding to an 8 dimensional 
regression coefficient after expanding the season covariate using dummy variables. 



Table 12 summarizes the results, which suggest that the Polya-Gamma method is a 
sensible default choice for fitting nonlinear mixed-effect models. 

While an independence Metropolis sampler usually works well for binary logistic regres- 
sion, it does not work well for the mixed models we consider. For instance, in the polls data 
set, at least two heuristics that suggest the Laplace approximation will be a poor proposal. 
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Synthetic: N = 500, P a = 5, P b = 1, samp=10,000, burn=2,000, thin=l 



Method 


time 


ARate 


ESS.min 


ESS.med 


ESS. max ESR.min ESR.med 


ESR.max 


PG 


7.29 


1.00 


4289.29 


6975.73 


9651.69 588.55 957.18 


1324.31 


Ind-Met. 


3.96 


0.70 


1904.71 


3675.02 


4043.42 482.54 928.65 


1022.38 




Polls: 


N = 2015 


, P a = 49, Pb 


= 1, samp= 


100,000, burn=20,000, thin=10 




Method 


time 


ARate 


ESS.min 


ESS.med 


ESS. max ESR.min ESR.med 


ESR.max 


PG 


31.94 


1.00 


5948.62 


9194.42 


9925.73 186.25 287.86 


310.75 


Ind-Met. 


146.76 


0.00674 


31.36 


52.81 


86.54 0.21 0.36 


0.59 




Xerop: 


N = 1200 


, P a — 275, Pb = 8, samp 


=100,000, burn=20,000, thin=10 




Method 


time 


ARate 


ESS.min 


ESS.med 


ESS. max ESR.min ESR.med 


ESR.max 


PG 


174.38 


1.00 


850.34 


3038.76 


4438.99 4.88 17.43 


25.46 


Ind-Met. 


457.86 


0.00002.5 


1.85 


3.21 


12.32 0.00 0.01 


0.03 



Table 12: A set of three benchmarks for binary logistic mixed models. iV denotes the 
number of samples, P a denotes the number of groups, and Pb denotes the dimension of 
the fixed effects coefficient. The random effects are limited to group dependent intercepts. 
Notice that the second and third benchmarks are thinned every 10 samples to produce a 
total of 10,000 posterior draws. Even after thinning, the effective sample size for each is 
low compared to the PG method. The effective samples sizes are taken for the collection 
(a, (3, m) and do not include 4>. 



First, the posterior mode does not coincide with the posterior mean. Second, the Hessian 
at the mode is nearly singular. Its smallest eigenvalue, in absolute terms, corresponds to 
an eigenvector that points predominantly in the direction of <p. Thus, there is a great deal 
of uncertainty in the posterior mode of <fi. If we iteratively solve for the MLE by starting 
at the posterior mean, or if we start at the posterior mode for all the coordinates except </>, 
which we initialize at the posterior mean of <fi, then we arrive at the same end point. This 
suggests that the behavior we observe is not due to a poor choice of initial value or a poor 
stopping rule. 

The first image in Figure S4 shows that the difference between the posterior mode and 



posterior mean is, by far, greatest in the <j) coordinate. The second image in Figure S4 



provides one example of the lack of curvature in <fi at the mode. If one plots <p against the 
other coordinates, then one sees a similar, though often less extreme, picture. In general, 
large values of <j> are found at the tip of an isosceles triangular whose base runs parallel to 
the coordinate that is not (p. While the upper tip of the triangle may posses the most likely 
posterior values, the rest of the posterior does not fall away quick enough to make that a 
likely posterior random variate. 



S5 Benchmarks: negative-binomial models 

We simulated two synthetic data sets with ./V = 400 data points using the model 

y-i ~ NB (mean = /ij, d) , log /Xj = a + Xij3 

where /3 E M 3 . Both data sets are included as supplements. The parameter d is estimated 
using a random-walk Metropolis-Hastings step over the integers. (Neither the Polya-Gamma 
method nor the R package by Fussl (2012) are set up to work efficiently with non-integer 
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Posterior mode and mean 



Log posterior at mode with variations in m, phi 




Figure 4: Proceeding from left to right and top to bottom. Upper left: the posterior mode 
and the posterior mean of (a, j3, m, <f>). The mode and mean are most different in (p. Upper 
right: the level sets of ((f), m) of the log posterior when the other coordinates are evaluated 
at the posterior mode. The log posterior is very flat when moving along (f>. Bottom left: the 
marginal posterior distribution of (f>. When marginalizing, one finds that few large values 
of are likely. Bottom right: a scatter plot of posterior samples for ((f), m). Again, one 
sees that upon marginalizing out the other coordinates the posterior mass is concentrated 
at relatively small values of (ft compared to its value at the posterior mode. 
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Fewer counts: a = 2, y = 8.11, J2 Vi = 3244, N = 400 



Method time ARate ESS.min ESS.med ESS. max ESR.min ESR.med ESR.max 

PG 26.84 1.00 7269.13 7646.16 8533.51 270.81 284.85 317.91 

FS 8.10 1.00 697.38 719.36 759.13 86.10 88.80 93.70 

RAM 10.17 30.08 737.95 748.51 758.57 72.59 73.62 74.61 

More counts: a = 3, y = 23.98, = 9593, N = 400 

Method time ARate ESS.min ESS.med ESS. max ESR.min ESR.med ESR.max 

PG 58.99 1.00 3088.04 3589.67 4377.21 52.35 60.85 74.20 

FS 8.21 1.00 901.50 915.39 935.06 109.73 111.45 113.84 

RAM 8.69 30.33 757.91 763.81 771.73 87.25 87.93 88.84 



Table 13: Negative binomial regression. PG is the Polya-Gamma Gibbs sampler. FS 
follows Friihwirth-Schnatter et al. (2009). RAM is the random walk Metropolis-Hastings 



sampler from the bayesm package (Rossi 2012). a is the true intercept and yi is the ith 



response. Each model has three continuous predictors. 



Gaussian process 1: y = 35.7, ~}2 yi = 9137, N = 256, I = 0.1, nugget=0.0 



Method 

PG 

FS 



time ARate 
101.89 1.00 
53.17 1.00 



ESS.min 
790.55 
481.36 



ESS.med 
6308.65 
1296.27 



ESS. max 
9798.04 
2257.27 



ESR.min 
7.76 
9.05 



ESR.med 
61.92 
24.38 



ESR.max 
96.19 
42.45 



Gaussian process 2: y = 22.7, J] y % = 22732, N = 1000, I = 0.1, nugget=0.0001 



Method time ARate ESS.min ESS.med ESS. max ESR.min 
PG 2021.78 1.00 1966.77 6386.43 9862.54 0.97 

FS 1867.05 1.00 270.13 1156.52 1761.70 0.14 



ESR.med ESR.max 
3.16 4.88 
0.62 0.94 



Table 14: Binomial spatial models. PG is the Polya-Gamma Gibbs sampler. FS follows 



Friihwirth-Schnatter et al. (2009). N is the total number of observations and yi denotes the 



ith observation. 
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values of this parameter.) The model with fewer counts corresponds to a = 2, while the 
model with more counts corresponds to a = 3. This produced a sample mean of roughly 8 
in the former case and 24 in the latter. 



Table 13 shows the results for both simulated data sets. Notice that the Polya-Gamma 
method has superior effective sample size in both cases, but a lower effective sampling rate 
in the second case. This is caused by the bottleneck of summing n copies of a PG(l,z) 
variable to draw a PG(n, z) variable. As mentioned in the main manuscript, it is an open 
challenge to create an efficient Polya-Gamma sampler for arbitrary n, which would make it 
the best choice in both cases. 

One reaches a different conclusion when working with more complicated models that 
devote proportionally less time to sampling the auxiliary variables. Specifically, consider 
the model 

yi ~ JVB(mean = fi(xi),d) , log/x ~ GP(0,K) , 
where K is the square exponential covariance kernel, 

'\x\ — X2I 19 



K(xi, X2) = k + exp 



2£ 2 

with characteristic length scale I and nugget k. Using either the Polya-Gamma or |Friihwirth- 



Schnatter et al. ( 2009 ) data augmentation techniques, one arrives at a complete conditional 
for v = log/U that is equivalent to the posterior (v\z) derived using pseudo-data {zi} gener- 
ated by 

Zi = V(xi) + €i, 6i ~ iV(0, Vi) 

where Vi is a function of the zth auxiliary variable. Since the prior for v is a Gaussian 
process one may use conjugate formulas to sample the complete conditional of v. But 
producing a random variate from this distribution is expensive as one must calculate the 
Cholesky decomposition of a relatively large matrix at each iteration. Consequently, the 
relative time spent sampling the auxiliary variables in each model decreases, making the 
Polya-Gamma method competitive, and sometimes better, than the method of Friihwirth- 



Schnatter et al. We provide two such examples in Table (14). In the first synthetic data set, 
256 equally spaced points were used to generate a draw v(xi) and yi for i = 1, . . . , 256 where 
v ~ GP(Q, K) and K has length scale i = 0.1 and a nugget = 0.0. The average count value 
of the synthetic data set is y = 35.7, yielding 9137 total counts, which is roughly the same 
amount as in the larger negative binomial example discussed earlier. Now, however, because 
proportionally more time is spent sampling the main parameter, and because the Polya- 
Gamma method wastes fewer of these expensive draws, it is more efficient. In the second 
synthetic data set, 1000 randomly selected points were chosen to generate a draw from v(xi) 
and yi with v ~ GP(0,K) where K has length scale t = 0.1 and a nugget = 0.0001. The 
average count value is y = 22.72, yielding 22,720 total counts. The larger problem shows 
an even greater improvement in performance over the method of Fruhwirth-Schnatter et al. 

S6 Extensions 
S6.1 2 x 2 x iV tables 

Consider a simple example of a binary-response clinical trial conducted in each of N different 
centers. Let riij be the number of patients assigned to treatment regime j in center i; and 
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Table 15: Data from a multi-center, binary-response study on topical cream effectiveness 
(Skene and Wakefieldl p90h . 



Treatment Control 
Center Success Total Success Total 



1 11 36 10 37 

2 16 20 22 32 

3 14 19 7 19 

4 2 16 1 17 

5 6 17 12 

6 1 11 10 

7 15 19 

8 4 6 6 7 



let Y = {i/ij} be the corresponding number of successes for i = 1, . . . , N. Table 1 presents a 
data set along these lines, from Skene and Wakefield (1990). These data arise from a multi- 
center trial comparing the efficacy of two different topical cream preparations, labeled the 
treatment and the control. 

Let pij denote the underlying success probability in center i for treatment j, and ipij 
the corresponding log-odds. If the ipi = {ipn,ipi2) T is assigned a bivariate normal prior 
ipi ~ N(/i, E) then the posterior for ^ = {ipij} is 



N 



p (* i ^) oc n 



i=l 



We apply Theorem 1 from the main paper to each term in the posterior, thereby intro- 
ducing augmentation variables fij = diag(wji, o;^) for each center. This yields, after some 
algebra, a simple Gibbs sampler that iterates between two sets of conditional distributions: 



{A 



{0Jij I Ipij) 



PG(mj,ipij) , 



(19) 



where 



mi 



Qi + 5T 1 

v^(«i + s-V) 

(yn - na/2, y i2 - n i2 /2) T . 



Figure [5] shows the results of applying this Gibbs sampler to the data from Skene and 



Wakefield (1990). 



In this analysis, we used a normal- Wishart prior for (/i, £ ). Hyperparameters were 
chosen to match Table II from Skene and Wakefield (1990), who parameterize the model in 
terms of the prior expected values for p, a'^ , and a^ 2 , where 



S 



a 



0i 
P 



P 

2 

V>2 
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Log-odds ratios in an 8-center binary-response study 



Treatment {95% CI) 
Control (95% CI) 



Treatment Center 



Figure 5: Posterior distributions for the log-odds ratio for each of the 8 centers in the 
topical-cream study from Skene and Wakefield (1990). The vertical lines are central 95% 
posterior credible intervals; the dots are the posterior means; and the X's are the maximum- 
likelihood estimates of the log-odds ratios, with no shrinkage among the treatment centers. 
Note that the maximum-likelihood estimate is Vte = — oo for the control group in centers 5 
and 6, as no successes were observed. 



To match their choices, we use the following identity that codifies a relationship between the 
hyperparameters B and d, and the prior moments for marginal variances and the correlation 
coefficient. If S ~ TW(d, B), then 



B = (d - 3) 



E«) + E«) + 2E(p) y jE(al) E«) E(^) + E(p)^E«) E«) 
E(a* 2 ) + E(pWE(4)E(<) E(^ 2 ) 



In this way we are able to map from pre-specified moments to hyperparameters, ending up 
with d = 4 and 

' 0.754 0.857 
0.857 1.480 



B 



S6.2 Higher-order tables 

Now consider a multi-center, multinomial response study with more than two treatment 
arms. This can be modeled using hierarchy of N different two-way tables, each having the 
same J treatment regimes and K possible outcomes. The data D consist of triply indexed 
outcomes x/ijk , each indicating the number of observations in center i and treatment j with 
outcome k. We let ny = Ylk Vii indicate the number of subjects assigned to have treatment 
j at center k. 

Let P = {pijk} denote the set of probabilities that a subject in center i with treatment 
j experiences outcome k, such that ^2 k Pijk = 1 for all Given these probabilities, the 
full likelihood is 

N J K 

i=i j=i k=i 
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Following Leonard (1975), we can model these probabilities using a logistic transforma- 
tion. Let 

Ei=i e w(^iji) 

Many common prior structures will maintain conditional conjugacy using the Polya-Gamma 
framework outlined thus far. For example, we may assume an exchangeable matrix-normal 
prior at the level of treatment centers: 

^ ~ N(M, E R , E c ) , 

where ipi is the matrix whose (j, k) entry is ipijk', M is the mean matrix; and E# and He are 
row- and column-specific covariance matrices, respectively. See Dawid (1981) for further 
details on matrix-normal theory. Note that, for identifiability, we set ipijK = 0, implying 
that Ec is of dimension K — 1. 

This leads to a posterior of the form 



N 



p(* | £>) = .J] 



1=1 



J K 



p^) ■ n n 



exp(V^ fc ) 

K 



Dij k 



suppressing any dependence on (M, Er, Ec) for notational ease. 

To show that this fits within the Polya-Gamma framework, we use a similar approach 

to 



Holmes and Held (|2006|), rewriting each probability as 



Pijk 



Ei^fc exp(V^-z) + exp(il) ijk ) 

^Pij k c ij k 



1 + e i/>ijk- 



where Cij k = log{Ei^fc vxpfyijl)} is implicitly a function of the other ipiji's for I ^ k. 

We now fix values of i and k and examine the conditional posterior distribution for 
tpi-k = (Alk, ipiJk)', given ipi.i for / ^ k: 



p^i.k | D, tP H -k)) OC pilfji.k | ^.(-fc)) ' II ( i + e^k-cy* 



i=l 



1 + e^fc- 



1ij Uijk 



3=1 



_J_ f^Pijk ( --ijk\'ft'ij 



This is simply a multivariate version of the same bivariate form in that arises in a 2 x 2 
table. Appealing to the theory of Polya-Gamma random variables outlined above, we may 
express this as: 



p(ipi. k i A^Hfe)) oc p(ipi. k i A-(-k)) ■ n 



p^ijk [tpijk-Cijk] 



L cosh^QV^fc - c ijk }/2) 



p(i/H.k I A-(-k)) ■ n Je"***^* -6 ** 1 .E|e"^ fe[ ^' fe - C ^ fcl2/2 } 



i=i 
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where u ijk ~ PG(n ij ,0), j = I,..., J; and K ijk = y ijk - riij/2. Given {u ijk } for j = 
1, . . . , J, all of these terms will combine in a single normal kernel, whose mean and covariance 
structure will depend heavily upon the particular choices of hyperparameters in the matrix- 
normal prior for ipi. Each uJij k term can be updated as 

{wij k | ipijk) ~ PG(nij,ip ijk - Cijk) , 

leading to a simple MCMC that loops over centers and responses, drawing each vector of 
parameters ipi. k (that is, for all treatments at once) conditional on the other Vt(-fc)'s. 

S6.3 Multinomial logistic regression 

One may extend the Polya- Gamma method used for binary logistic regression to multino- 
mial logistic regression. Consider the multinomial sample yi = {yij}j = i that records the 
number of responses in each category j = 1, . . . , J and the total number of responses nj. 
The logistic link function for polychotomous regression stipulates that the probability of 
randomly drawing a single response from the jth category in the ith. sample is 

exp ipjj 

where the log odds ipij is modeled by xjf3j and (3j has been constrained to be zero for pur- 
poses of identification. Following Holmes and Held (2006) the likelihood for f3j conditional 
upon (3~j, the matrix with column vector (3j removed, is 

i=l x ' K ' 

where 

rjij = xj(3j - Cij with dj = log ^ exp xj (3 k , 

which looks like the binary logistic likelihood previously discussed. Incorporating the Polya- 
Gamma auxiliary variable, the likelihood becomes 

N 4 

e Ki ^e--i L LJi j PG{L} ij \n i , 0) 

where Kij = (yij — nj/2). Employing the conditionally conjugate prior f3j ~ N(moj, Voj) 
yields a two-part update: 

(uij | (3j) ~ PG(rii, rjij) for i = 1, ■ • • ,N, 

where 

v- l = x'^x + v-\ 

mj = Vj (x'{Kj - SljCj) + V^moj) 
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Class 1 2 3 5 6 7 



Total 70 76 17 13 9 29 
Correct 50 55 9 9 27 

Table 16: "Correct" refers to the number of glass fragments for each category that were 
correctly identified by the Bayesian multinomial logit model. The glass identification dataset 
includes a type of glass, class 4, for which there are no observations. 



and Oj = diagdwij}^). One may sample the posterior distribution of (j3 \ y) via Gibbs 
sampling by iterating over the above steps for j = 1, . . . , J — 1. 

The Polya-Gamma method generates samples from the joint posterior distribution with- 
out appealing to analytic approximations to the posterior. This offers an important ad- 
vantage when the number of observations is not significantly larger than the number of 
parameters. 

To see this, consider sampling the joint posterior for (3 using a Metropolis-Hastings 
algorithm with an independence proposal. The likelihood in (5 is approximately normal, 
centered at the posterior mode m, and with variance V equal to the inverse of the Hessian 
matrix evaluated at the mode. (Both of these may be found using standard numerical 
routines.) Thus a natural proposal for (vec(/3^- 1 ) | y) is vec(b) ~ N(m,aV) for some a ~ 1. 
When data are plentiful, this method is both simple and highly efficient, and is implemented 
in many standard software packages (e.g. Martin et al. 2011). 

But when vec(/3) is high-dimensional relative to the number of observations the Hessian 
matrix H may be ill-conditioned, making it impossible or impractical to generate normal 
proposals. Multinomial logistic regression succumbs to this problem more quickly than 
binary logistic regression, as the number of parameters scales like the product of the number 
of categories and the number of predictors. 

To illustrate this phenomenon, we consider glass-identification data from German ( 1987). 
This data set has J = 6 categories of glass and nine predictors describing the chemical and 
optical properties of the glass that one may measure in a forensics lab and use in a crim- 
inal investigation. This generates up to 50 = 10 x 5 parameters, including the intercepts 
and the constraint that j3j = 0. These must be estimated using n = 214 observations. 
In this case, the Hessian H at the posterior mode is poorly conditioned when employing 
a vague prior, incapacitating the independent Metropolis-Hastings algorithm. Numerical 
experiments confirm that even when a vague prior is strong enough to produce a numeri- 
cally invertible Hessian, rejection rates are prohibitively high. In contrast, the multinomial 
Polya-Gamma method still produces reasonable posterior distributions in a fully automatic 
fashion, even with a weakly informative normal prior for each (5j. Table 16, which shows 
the in-sample performance of the multinomial logit model, demonstrates the problem with 
the joint proposal distribution: category 6 is perfectly separable into cases and non-cases, 
even though the other categories are not. This is a well-known problem with maximum- 
likelihood estimation of logistic models. The same problem also forecloses the option of 
posterior sampling using methods that require a unique MLE to exist. 
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